power swing
Power Stabilization for AI Training Datacenters
Choukse, Esha, Warrier, Brijesh, Heath, Scot, Belmont, Luz, Zhao, April, Khan, Hassan Ali, Harry, Brian, Kappel, Matthew, Hewett, Russell J., Datta, Kushal, Pei, Yu, Lichtenberger, Caroline, Siegler, John, Lukofsky, David, Kahn, Zaid, Sahota, Gurpreet, Sullivan, Andy, Frederick, Charles, Thai, Hien, Naughton, Rebecca, Jurnove, Daniel, Harp, Justin, Carper, Reid, Mahalingam, Nithish, Varkala, Srini, Kumbhare, Alok Gautam, Desai, Satyajit, Ramamurthy, Venkatesh, Gottumukkala, Praneeth, Bhatia, Girish, Wildstone, Kelsey, Olariu, Laurentiu, Incorvaia, Ileana, Wetmore, Alex, Ram, Prabhat, Raghuraman, Melur, Ayna, Mohammed, Kendrick, Mike, Bianchini, Ricardo, Hurst, Aaron, Zamani, Reza, Li, Xin, Petrov, Michael, Oden, Gene, Carmichael, Rory, Li, Tom, Gupta, Apoorv, Patel, Pratikkumar, Dattani, Nilesh, Marwong, Lawrence, Nertney, Rob, Kobayashi, Hirofumi, Liott, Jeff, Enev, Miro, Ramakrishnan, Divya, Buck, Ian, Alben, Jonah
Large Artificial Intelligence (AI) training workloads spanning several tens of thousands of GPUs present unique power management challenges. These arise due to the high variability in power consumption during the training. Given the synchronous nature of these jobs, during every iteration there is a computation-heavy phase, where each GPU works on the local data, and a communication-heavy phase where all the GPUs synchronize on the data. Because compute-heavy phases require much more power than communication phases, large power swings occur. The amplitude of these power swings is ever increasing with the increase in the size of training jobs. An even bigger challenge arises from the frequency spectrum of these power swings which, if harmonized with critical frequencies of utilities, can cause physical damage to the power grid infrastructure. Therefore, to continue scaling AI training workloads safely, we need to stabilize the power of such workloads. This paper introduces the challenge with production data and explores innovative solutions across the stack: software, GPU hardware, and datacenter infrastructure. We present the pros and cons of each of these approaches and finally present a multi-pronged approach to solving the challenge. The proposed solutions are rigorously tested using a combination of real hardware and Microsoft's in-house cloud power simulator, providing critical insights into the efficacy of these interventions under real-world conditions.
Data-driven Protection of Transformers, Phase Angle Regulators, and Transmission Lines in Interconnected Power Systems
This dissertation highlights the growing interest in and adoption of machine learning (ML) approaches for fault detection in modern power grids. Once a fault has occurred, it must be identified quickly and preventative steps must be taken to remove or insulate it. As a result, detecting, locating, and classifying faults early and accurately can improve safety and dependability while reducing downtime and hardware damage. ML-based solutions and tools to carry out effective data processing and analysis to aid power system operations and decision-making are becoming preeminent with better system condition awareness and data availability. Power transformers, Phase Shift Transformers or Phase Angle Regulators, and transmission lines are critical components in power systems, and ensuring their safety is a primary issue. Differential relays are commonly employed to protect transformers, whereas distance relays are utilized to protect transmission lines. Magnetizing inrush, overexcitation, and current transformer saturation make transformer protection a challenge. Furthermore, non-standard phase shift, series core saturation, low turn-to-turn, and turn-to-ground fault currents are non-traditional problems associated with Phase Angle Regulators. Faults during symmetrical power swings and unstable power swings may cause mal-operation of distance relays and unintentional and uncontrolled islanding. The distance relays also mal-operate for transmission lines connected to type-3 wind farms. The conventional protection techniques would no longer be adequate to address the above challenges due to limitations in handling and analyzing massive amounts of data, limited generalizability, incapability to model non-linear systems, etc. These limitations of differential and distance protection methods bring forward the motivation of using ML in addressing various protection challenges.